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Abstract 

Recognising persons in everyday photos presents ma¬ 
jor challenges (occluded faces, different clothing, locations, 
etc.) for machine vision. We propose a convnet based per¬ 
son recognition system on which we provide an in-depth 
analysis of informativeness of different body cues, impact of 
training data, and the common failure modes of the system. 
In addition, we discuss the limitations of existing bench¬ 
marks and propose more challenging ones. Our method is 
simple and is built on open source and open data, yet it im¬ 
proves the state of the art results on a large dataset of social 
media photos (PIPA). 


1. Introduction 

Person recognition in private photo collections is chal¬ 
lenging: people can be shown in all kinds of poses and 
activities, from arbitrary viewpoints including back views, 
and with diverse clothing (e.g. on the beach, at parties, etc., 
see Figure 1). This paper presents an in-depth analysis of 
the problem of person recognition in photo albums: given 
a few annotated training images of a person (possibly from 
different albums), and a single image at test time, can we 
tell if the image contains the same person? 

Intuitively, the ability to recognize faces in the wild [22] 
is an important ingredient. However, when persons are en¬ 
gaged in an activity (i.e. not posing) their face becomes only 
partially visible (non-frontal, occlusion) or simply fully 
non-visible (back-view). Therefore, additional information 
is required to reliably recognize people. We explore three 
other sources: first, body of a person contains information 
about their shape and appearance; second, human attributes 
such as gender and age help to reduce the search space; and 
third, scene context further reduces ambiguities. 

The main contributions of the paper are the following. 
First, we provide a detailed analysis of performance of 
different cues (§3). Second, we propose a more realistic 
and challenging experimental protocols over PIPA (§5.1) 
on which a deeper understanding of robustness of different 
cues can be attained (§5.2). Third, in the process, we obtain 
best results on the recently proposed PIPA dataset and show 
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Figure 1: Person recognition in photo albums is hard. To 
handle the diverse scenarios we need to exploit multiple 
cues from different body regions and information sources. 
Photos show test cases successfully recognised by our sys¬ 
tem, ticks indicate which ingredient could handle it. For 
example, the surfer is not recognised when using only head 
or head-bbody cues. However, it is successfully recognised 
when the additional attribute cues are provided. 


that previous performance can be matched without special¬ 
ized face recognition or pose estimation (§4). Fourth, we 
analyse remaining failure modes (§5). Additionally, our 
top-performing method is based only on open source code 
and data, and the new experimental setups (§5.1), trained 
models, results, and attribute annotations are available at 
http://goo.gl/DKuhlY. 

1.1. Related work 

Data type The bulk of previous work on person recogni¬ 
tion focus either on facial features [22] (only the head/face 
is visible), or on the surveillance scenario [3, 2] (full 
body is visible, usually in low resolution). Both settings 
have seen a recent shift from sophisticated classifiers based 
on hand-crafted features and metric learning approaches 
[20, 7, 5, 30, 27, 42, 1], towards methods based on deep 
learning [38, 37, 44, 34, 28, 39, 21]. 

In this paper we tackle a different scenario, where per¬ 
sons may appear at different zoom levels (e.g. only head, 
upper torso, full body visible), and in any pose (e.g. sitting, 
running, posing), and from any point of view (e.g. front, 
side, back view), see Figures 1 and 7. The “Gallagher col- 
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lection person dataset” [15] was the first dataset covering 
this scenario; however, it is quite small ('-600 images, 32 
identities) and only frontal faces are annotated. We build 
our paper upon the recently introduced PIPA dataset [41] 
which is two orders of magnitude larger ('-40k images, '-2k 
identities), more diverse, and also provides identity anno¬ 
tations when the face is not visible. We describe PIPA in 
more detail in §2. 

Recognition tasks There exist multiple tasks related to 
person recognition [19] differing mainly in the amount 
of training and testing data. Face and surveillance re¬ 
identification is most commonly done via “verification” 
(one reference image, one test image; do they show the 
same person?) [22, 2]. The scenario of our interest is '-20 
training images and one test image. 

Other related tasks are, for instance, face clustering [9, 
34], finding important people [31], or associating names in 
text to faces in images [13, 14]. 

Recognition cues The base cue for person recognition 
is the appearance of the face itself. Face normalization 
(“frontalisation”) [45, 38, 12] improves robustness to pose, 
view-point and illumination. Similarly, pose-independent 
descriptors can be built for the body [8, 17, 41]. 

Multiple other cues have been explored, for example: 
attributes classification [25, 26], explicit cloth modelling 
[15], relative camera positions [18], social context [16, 36], 
space-time priors [29], and photo-album priors [35]. 

The PIPA dataset was introduced together with the refer¬ 
ence PIPER method [41]. PIPER obtains promising results 
combining three ingredients: a convnet (AlexNet [24]) pre¬ 
trained on ImageNet [10], the DeepFace re-identification 
convnet (trained on a large private faces dataset) [38], and 
Poselets [4] (trained on H3D) to obtain robustness to pose 
variance. In contrast, this paper considers features based on 
open data and use the same AlexNet network for all the im¬ 
age regions considered, thus providing a direct comparison 
of contributions from different image regions. 

2. PIPA dataset 

The recently introduced PIPA dataset (“People In Photo 
Albums”) [41] is, to the best of our knowledge, the first 
dataset to annotate identities of people with back views. 
The annotators labelled many instances that can be consid¬ 
ered hard even for humans (Figure 7). PIPA features 37107 
Flickr personal photo album images (Creative Commons), 
with 63 188 head bounding boxes of 2 356 identities. The 
dataset is partitioned into train, validation, test, and leftover 
sets, with rough ratio 45:15:20: 20. Up to annotation er¬ 
rors, neither identities nor photo albums by the same up- 
loader are shared among these sets. 

For valid comparisons, we follow the PIPA protocol in 
[41]. The training set is used for feature learning and the 


validation set for exploring and optimising options. The 
test set is for evaluation of our methods (Table 4); it is itself 
split in two parts, testo/ testi, with roughly the same num¬ 
ber of instances per identity. Given testo a classifier is learnt 
for each identity (11 examples per identity on average), and 
these are evaluated on testi (and vice-versa). Later we con¬ 
sider more challenging splits than the PIPA default (§5.1). 

At test time, the system is fed with the photo of the test 
instance and the ground truth head annotation (tight around 
the skull, face and hair included; not fully visible heads are 
hallucinated by the annotators). The task is to find the cor¬ 
responding identity of the head. 

In the next section, various image regions and the corre¬ 
sponding recognition cues are defined (§3.1), and their val¬ 
idation set performances are compared (§3. 3 to §3.7). The 
performance of our final system and comparisons to other 
methods and baselines are provided in §4. §5 will present 
an in-depth analysis of the systems, including the perfor¬ 
mance on the more realistic and challenging PIPA splits. 
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3. Cues for recognition 

Our person recognition 
system is performant yet 
simple. At test time, given a 
(ground truth) head bound¬ 
ing box, we estimate (based 
on the box size) five differ¬ 
ent regions depicted. Each 
region is fed into one (or 
more) convnet(s) to obtain 
a set of feature vectors. 

The vectors are concate¬ 
nated and fed into a linear 
SVM, trained per identity 
as one versus the rest (on 
testo/i). In our final system 
all features are computed 
using the seventh layer of an 
AlexNet [24] pre-trained for 
ImageNet classification (al¬ 
beit we explore alternatives 

in the next sections). The cues only differ among each 
other by the image region considered, and by the fine-tuning 
used to alter the AlexNet model (type of data or surrogate 
task).[24] 


Eigure 2: Regions con¬ 
sidered for feature extrac¬ 
tion: face f , head h, up¬ 
per body u, full body b, 
and scene s. More than 
one feature vector can be 
extracted per region (e.g. 
hl,h2 ). 


Compared to PIPER [41], we merge cues with a sim¬ 
pler schema and do not use specialized face recognition or 
pose estimation. Instead, we explore different directions: 
how informative are fixed body regions (no pose estima¬ 
tion) (§3.3)7 How much does scene context help (§3.4)7 
And how much do we gain by using extended data (§3.6 & 
§3.7)7 This section is based exclusively on validation set. 
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Method 


Accuracy 


Chance level 


0.27 

Scene (§3.4) 

s 

27.06 

Body 

b 

80.81 

Upper body 

u 

84.76 

Head 

h 

83.88 

Face (§3.5) 

f 

74.45 

Face-^head 

f + h 

84.80 

Full person 

P = f-hh + u+b 

91.14 

Full image 

P. = P + s 

91.16 


Table 1: Validation set accuracy of different cues. More 
detailed combinations in Appendix Table 5. 

3.1. Image regions used 

We choose five different image regions based on the 
ground truth head annotation (given at test time, see 
§2). The head rectangle h corresponds to the ground 
truth annotation. The full body rectangle b is defined as 
(3 X head width, 6 x head height), with the head at the top 
centre of the full body. The upper body rectangle u is the 
upper-half of b. The scene region s is the whole image 
containing the head. 

We use a face detector to find the face rectangle f in¬ 
side each test head. We use the open source state of the art 
method of [32], which also provides a rough indication of 
the head yaw rotation (frontal, 45°, 90° side view). When 
no detection matches an annotation (e.g. back views), we 
regress the face area from the head bounding box. More 
details on the performance of this detector are in §3.5. Five 
respective image regions are illustrated in Figure 2. 

Please note that regions overlap with each other, and that 
those pose agnostic crops may not match the actual regions. 

3.2. Fine-tuning and parameters 

Unless specified otherwise AlexNet is fine-tuned using 
PIPA’s person recognition training set (^ 30k instances, 
^ 1.5k identities), cropped at different regions, with 300k 
mini-batch iterations (batch size 50). We refer to the base 
cue thus obtained as f, h, u, b, or s, depending on the crop. 
On the validation set we found fine-tuning to provide a sys¬ 
tematic ^10 percent points (pp) gain over not fine-tuned 
AlexNet. Since we use seventh layer of AlexNet, each cue 
adds 4 096 dimensions to our concatenated feature vector. 

We train for each identity linear classifier using SVM 
regularization parameter C = 1. On the validation set the 
SVM classifier consistently outperforms by a ^ 10 pp mar¬ 
gin the naive nearest neighbour (NN) classifier. Additional 
details can be found in Appendix §G. 


Gist Sgist 21.56 

PlacesNet scores Spiaces 205 21.44 

raw PlacesNet sq places 27.37 

PlacesNet fine-tuned S 3 places 25.62 

raw AlexNet sq 26.54 

AlexNet fine-tuned s = S 3 27.06 


Table 2: Validation set accuracy of different feature vectors 
for the scene region s. See descriptions in §3.4. 

3.3. How informative is each image region ? 

Table 1 shows the validation set results of each region in¬ 
dividually and in combination. Head and upper body are the 
strongest individual cues. We discuss head and face in §3.5. 
Upper body is more reliable than the full body, because we 
observe that legs are commonly occluded (or out of frame) 
and thus become a distractor. Scene is, unsurprisingly, the 
weakest individual cue, but it still contains useful informa¬ 
tion for person recognition (far above chance level). Impor¬ 
tantly, we see that all cues complement each other (despite 
having overlapping pixels). 

Conclusion On the validation set at least, our features and 
combination strategy seems quite effective. 

3.4. Scene (s) 

Other than a fine-tuned AlexNet we considered multiple 
feature types to encode the scene information. Sgist: us¬ 
ing the Gist descriptor [33] (512 dimensions), sq places^ 
instead of using AlexNet pre-trained on ImageNet, we con¬ 
sider an AlexNet (PlacesNet) pre-trained on 205 scene cat¬ 
egories of the “Places Database” [43] (^2.5 million im¬ 
ages). Spiaces 2 05 • Instead of the 4 096 dimensions Places¬ 
Net feature vector, we also consider using the score vector 
for each scene category (205 dimensions). 80 , 33 : we con¬ 
sider using AlexNet in the same way as for body or head 
(with zero or 300k iterations of fine-tuning on the PIPA 
person recognition training set). S 3 places : sq places 
fine-tuned for person recognition. 

Results Table 2 compares the different alternatives on the 
validation set. The Gist descriptor Sgist performs only 
slightly below the convnet options (4 608 dimensional ver¬ 
sion of Gist gives worse results). Using the raw (and longer) 
feature vector of sq places is better than the class scores of 
Spiaces 205 - Interestingly, in this context pre-training for 
places classification is better than pre-training for objects 
classification (sq places versus sq). After fine-tuning S 3 
reaches a similar performance as sq places- 
Experiments trying different combinations indicate that 
there is little complementarity between these features. 
Since there is not a large difference between sq places and 
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S 3 , for sake of simplicity we use S 3 as our scene cue s in 
all other experiments. 

Conclusion Scene by itself, albeit weak, can obtain re¬ 
sults far above chance level. After fine-tuning, scene recog¬ 
nition as pre-training surrogate task [43] does not provide a 
clear gain over (ImageNet) object recognition. 

3.5. Head (h) or face (f) ? 

A large portion of work on face recognition focuses on 
the face region specifically. In the context of photo albums, 
we aim to quantify how much information is available in the 
head versus the face region. 

The face region f is defined by a state of the art face de¬ 
tector [32] (see §3.1). Since no face annotations are avail¬ 
able on PIPA, we validate the face detection location by 
learning a linear regressor from f to h (per DPM compo¬ 
nent). When using these heads estimates (^ 75% of heads 
replaced) instead of the ground truth head (h in Table 1), 
results drop only 0.45% thus indirectly validating that faces 
are well localized. 

Results When using the face region, there is a large gap of 
^10 percent points performance between f and h in Table 
1 highlighting the importance of including the head region 
around the face in the descriptor. 

When evaluating only on the frontal faces of validation 
set (as indicated by the detector) f reaches 81% accuracy 
and 70% for non-frontal faces. The performance drop be¬ 
tween frontal versus handling profile and back views is less 
dramatic than one could have suspected. 

In comparison, on frontal faces in test set, DeepFace 
reaches ^90% [41], and returns the chance level (0.17%) 
otherwise. The test set contains about 50% of non-frontal 
faces. On test set f obtains 74% and 57% for frontal and 
non-frontal faces, respectively (18 pp drop), while h obtains 
82% and 70%, respectively (12 pp drop). 

Conclusion Using h is more effective than f , both due 
to improved recognition for frontal faces and robustness to 
head rotation. That being said, f results show fair perfor¬ 
mance to recognise non-frontal faces. As with other body 
cues, there is complementarity between h and f and we thus 
suggest to use them together. 

3.6. Additional training data (hcacd.hcasia) 

It is well known that deep learning architectures ben¬ 
efit from additional data. PIPER’s DeepFace is trained 
over 4.4 • 10^ faces of 4 • 10^ persons (the private SFC 
dataset [38]). In comparison our cues are trained over Im¬ 
ageNet and PIPA’s 29 • 10^ faces over 1.4 • 10^ persons. 
To measure the effect of training on larger data we con¬ 
sider fine-tuning using two open face recognition datasets: 
CASIA-WebFace (CASIA) [40] and the “Cross-Age Refer¬ 
ence Coding Dataset” (CACD) [ 6 ]. 


Method 


More data (§3.6) 

h 

83.88 

(head region) 

h + hcacd 

84.88 


h hcasia 

86.08 


h hcasia 3“ hcacd 

86.26 

Attributes (§3.7) 

hpipalIm 

74.63 

(head region) 

hpipal1 

81.74 


h + hpipaii 

85.00 

(upper body region) 

'*^peta5 

77.50 


U UpetaS 

85.18 

(head+upper body) 

— hpipaii UpetaS 

86.17 


h + u 

85.77 


h + u + A 

90.12 


Table 3: Validation set accuracy of different cues based on 
extended data. See §3.6 and §3.7 for details. 

CASIA contains 0.5 • 10^ images of 10.5 • 10^ persons 
(mainly actors and public figures), and is (to the best of our 
knowledge) the largest open dataset for face recognition. 
When fine-tuning AlexNet over these identities (using the 
head area h), we obtain the hcasia cue. 

CACD contains 160 • 10^ faces of 2 • 10^ persons with 
varying ages. Although smaller than CASIA, CACD fea¬ 
tures greater number of face examples per subject (^ 2 x). 
The hcacd cue is built via the same procedure as hcasia- 

Results The improvement of h -h hcacd and h -h hcasia 
over h show that cues from outside training data are com¬ 
plementary to h (see top part of Table 3). hcacd and hcasia 
on their own are about ^ 5 pp worse than h. hcacd and 
hcasia exhibit slight complementarity. 

Conclusion Adding more data, even from different type 
of photos, is an effective means to improve the performance. 

3.7. Attributes (hpipan, Upetas) 

Albeit overall appearance might change day to day, one 
could expect that long term attributes provide means for 
recognition. We thus explore building feature vectors by 
fine-tuning AlexNet not on person recognition (like for all 
other cues), but rather for attributes classification as a sur¬ 
rogate task. We consider two sets of annotations. 

We have annotated the PIPA train and validation sets 
(1409 + 366 identities) with five long term attributes: age, 
gender, glasses, hair colour, and hair length (11 binary bits 
total; see Appendix §I for details). We use the h crops to 
build hpipaii, as the attributes are head centric. 

We also consider using the “PETA pedestrian attribute 
dataset” [11], which features 105 attributes annotations for 
19-10^ full-body pedestrian images. Out of 105 we chose 
the five binary attributes that are long term and are well rep¬ 
resented in PETA: gender, age (young adult, adult), black 
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Accuracy 


Chance level 

0.17 

Body 

GIobalModel [41] 

67.60 


b 

69.63 

Head 

DeepEace [41] 

46.66 


h 

76.42 

Extended data 

h hcasia 3“ hcacd 

79.63 


PIPER [41] 

83.05 

Head+Body 

h+b 

83.36 

Full person 

P = f + h + u+b 

85.33 

Full image 

P. = P + s 

85.71 

Extended data 

naeil = Ps + E 

86.78 

Combining 

PIPER[41]+P 

87.67 

with [41] 

PIPER[41>Hiaeil 

88.37 


Table 4: Test set accuracy of different cues and their com¬ 
binations under the original PIPA evaluation protocol. 
Extended data E hQ^gj_^-|-hQ^Q(^-|-hpj_p^ii+Upeta5* 

hair, and short hair (details in Appendix §1). Since upper- 
body u is less noisy than the full-body b (see Table 1), up¬ 
per body crops of PETA are used to fine-tune AlexNet. The 
Upetas cue is thus built. 

Results Eor attribute fine-tuning we consider two ap¬ 
proaches: training a single network for all attributes (multi¬ 
label classification with sigmoid cross entropy loss), or tun¬ 
ing one separate network per attribute (softmax loss) and 
then concatenating their feature vectors. The results on val¬ 
idation data indicate the second choice (hpipaii) performs 
better than the first (hpipaum)- 

Table 3 (bottom) shows that attribute classification as 
surrogate task does help person recognition. Both PIPA 
(hpipaii) and PETA (upetas) annotations behave similarly 
(^ 1 pp gain over h and u), and show good complemen¬ 
tary among themselves (^5 pp gain over h+u). Amongst 
the attributes considered, gender contributes the most to im¬ 
prove recognition accuracy (for both attributes datasets). 

Conclusion Adding attributes classification as a surrogate 
task improves performance. 

4. Test set results 

All experiments in this paper are limited to a person 
recognition scenario where head boxes are provided by hu¬ 
man annotations, and all test faces belong to a known finite 
set. Table 4 reports the performance on the test set of the dif¬ 
ferent cues described in previous sections. We study their 
complementarity to each other, and compare them against 
the PIPER components [41]. A more detailed table and 
the corresponding validation set results are included in Ap¬ 
pendix Table 6. 

We also report computational times for some pipelines 
in our method. The feature training takes 2-3 days on a 


single GPU machine. The SVM training takes 42.20s for h 
(4096 dim) and 1303.30s for naeil (4096 x 17 dim) on the 
Original split (581 classes, 6443 samples). Note that this 
corresponds to a realistic user scenario in a photo sharing 
service where ^ 500 identities are known to the user and 
the average number of photos per identity is ^ 10. 

Compared to PIPER, our framework is computationally 
efficient in two aspects. Eirst, our system does not need 
to learn to assign weights for different cues. Second, the 
PIPER feature has roughly 4096 x 108 dimensions, requir¬ 
ing far more memory and training time than our final system 
(4096 X 17 dim). 

Body b Considered alone, our body cue b is a reimple¬ 
mentation of piper’s GIobalModel [41]. As expected, 
we obtain a similar accuracy. 

Head h On the other hand, our head cue h is more ef¬ 
fective than the corresponding PIPER’sDeepEace. As 
discussed in §3.5, we have observed that: a) for this task 
the head region is more informative than the face (focusing 
on the face region is detrimental); b) our approach is much 
more robust for non-frontal faces (^ 50% of test cases), 
where h reaches 70% accuracy, while DeepEace becomes 
uninformative in this case. When extending the training 
data our head performance further improves (see also the 
discussion in §5.4). 

Head+body b+b Our minimal system matching P IPER’s 
performance is h + b, with accuracy 83.36%. Note that 
the feature vector of h + b is about 50 times smaller than 
piper’s. 

In principle PIPER captures the head region via one of 
its poselets. Thus, h + b extracts cues from a subset of 
piper’s “GIobalModel+Poselets” [41], which only 
reaches 78.79% . 

Full person P Similar to the validation set results, having 
more cues further improves results. P = f+h-hu+b obtains 
a clear margin over PIPER, yet is a simpler system (neither 
specialised face recognition nor pose estimation used) built 
with less training data (only PIPA for fine-tuning, ImageNet 
for pre-training, and the face detector training set). 

naeil Adding scene s (§3.4) and extended data E (§3.6 
& §3.7) contributes to the last 1 percent point. We name our 
final method naeil^ Its feature vector is 6 times smaller 
than P IPER’s, and it provides the best known results on the 
PIPA dataset. 

Eigures 1 and 7 show some example results of our sys¬ 
tem. §5.4 analyses the remaining hard test cases. 

4.1. Complementarity between piper and naeil 

Since PIPER uses different training data than naeil 
we can expect some complementarity between the two 

^“naeil”, means “tomorrow” and sounds like “nail”. 









methods. For experiments, we use the PIPER scores pro¬ 
vided by the authors of [41]. Note, however, that the 
PIPER features are unavailable. By averaging the out¬ 
put scores of the two methods (PIPER + naeil) gain 
^ 1.5 percent points, reaching 88.37%. Using a more so¬ 
phisticated strategy might provide more gain, but we al¬ 
ready see that naeil covers most of the performance from 
PIPER. 

4.2. Towards an open world setting 

All experiments in this paper are limited to a person 
recognition scenario where head boxes are provided by hu¬ 
man annotations, and all test faces belong to a known fi¬ 
nite set. Not providing ground truth heads at test time is an 
arguably more realistic and challenging scenario in which 
both person detection and recognition need to be solved 
jointly. 

Using a face detector (§3.5) as our person detector over 
the test set, we reach ^78% recall at (average) ten detec¬ 
tions per image (^ 70% at 3 detections/iniage). If we use 
naeil to label these faces, we reach ^65% recall on the 
testo/i identities (^62% at 3 detections/image). 

The performance drops, but less dramatically than what 
one might expect. It remains as future work to implement a 
detailed evaluation in the open world setting. 

4.3. A naive baseline 

Given the inherent difficulty of the PIPA person recogni¬ 
tion task (see Figure 7) reaching a ^ 85% accuracy seems 
suspiciously high. Thus, we investigate the issue using a 
crude baseline h^gb that takes the raw RGB pixel values of 
the head area as features (after downsizing to 40 x 40 pixels 
and blurring), and uses a nearest neighbour classifier. By 
design hrgb is only able to recognize near identical heads 
across the testo/i split, yet it reaches a surprisingly high 
33.77% (49.46%) accuracy on the test set (validation set). 

Conclusion About i/s of the original PIPA test splits is 
easy to solve. This motivates us to explore more realistic 
splits and protocols. In the next section we discuss the issue 
and propose solutions via new test splits. 

5. Analysis of person recognition challenges 

This section provides a detailed analysis of the obtained 
results and shares insights on addressing future challenges. 

As we have seen in §4.3, the current setup includes many 
easy examples, limiting us from exploring more difficult di¬ 
mensions of the problem. Accordingly, we propose three 
new testo/ testi splits of PIPA in §5.1. Based on the new 
splits, we analyse the robustness of different cues across ap¬ 
pearance changes (§5.2). We then discuss the effect of the 
amount of person specific training data (§5.3), and provide 
a failure mode analysis in §5.4. 


Day split testo split testi 



Original split testo 


Original split testi 


Figure 3: Visualisation of Original and Day splits for one 
identity. Greater appearance changes are observed across 
the Day split. 

5.1. New PIPA splits with varying difficulty and 
challenges 

We have seen a strong performance of our main system 
naeil (86.78% on test set. Table 4) and the baseline hrgb 
(33.77% on test set, §4.3) despite the challenging task of 
person recognition in photo albums. This motivates us to 
investigate more difficult and realistic setups. 

Limitations of original setup The main limitation of the 
original PIPA protocol is that the testo/ testi splits are even- 
odd instances from a sample list that largely preserves the 
photo orders in albums. When photos are taken in a short 
period of time, adjacent photos can be nearly identical. 
However, a main challenge in person recognition is to gen¬ 
eralise across long-term appearance changes of a person; 
we thus introduce a range of new splits on PIPA in the order 
of increasing difficulty: 

Original split O: We keep the original split in our study 
for comparison. The split is on the odd vs even basis. 

Album split A: All samples are organised by albums. 
This split assigns for each person identity samples from sep¬ 
arate albums, while keeping the number of samples equal 
for the splits. Since it is not always possible to satisfy both 
conditions, a few albums are shared between the splits. In 
this setup, training and test samples are split across different 
events and occasions. 
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Figure 4: Recognition accuracy 
across different experimental setups 
on test set. 


Figure 5: Test set accuracy of cues in 
different settings, relative to naeil. 


Figure 6: Recognition accuracy at 
different sizes of training examples. 


Time split T : This split investigates the temporal dimen¬ 
sion of the photos. For each person identity, we sort all 
photos by their “photo-taken-date” metadata. We split them 
into newest versus oldest images. The instances without 
time metadata are distributed evenly. This split emphasises 
the temporal distance between training and test. 


Day split V\ T does not always make a time gap: many 
people appear only on one event, and the time metadata are 
often missing. We thus make the split manually according 
to the two conditions: either a firm evidence of date change 
such as {change of season, continent, event, co-occurring 
people} between the splits, or visible changes in (hairstyle, 
make-up, head or body wear). These rules enforce “ap¬ 
pearance changes”. For each identity, we randomly discard 
instances from the larger test set until sizes match. If there 
are less than 5 instances in the split, we discard the identity 
altogether (Original split applies the same criterion). After 
pruning, 199 identities (out of 581) were left, with about 
20 training samples per identity (similar range as all other 
splits). 


Results Figure 4 provides an overview of how the raw 
colour baseline hrgb and our naeil approach perform 
across different splits. We observe that the unreasonably 
good performance by the h^gb baseline consistently de¬ 
grades from the Original over Album and Time to the Day 
splits, indicating the increasing amount of non-trivial recog¬ 
nition tasks. Compared to the 1/5 drop by the hrgb baseline 
(33.77% to 6.78%), naeil’s performance is less impaired 
(86.78% to 46.48%), indicating naeil’s ability to address 
more realistic scenarios characterised by changes in appear¬ 
ance, location and time. 


5.2. Importance of features 

To gain a deeper understanding of relative importance of 
different cues and their robustness across splits, we consider 
Figure 5 which shows the results normalised by the perfor¬ 
mance of naeil (100%). This allows us to analyse which 
features maintain, loose or gain discriminative power when 
moving from the easier to the more challenging settings. 

We observe the strongest drops in relative performance 
for body and upper body features, due to the loss of dis- 
criminability of global features (e.g. clothing). We see con¬ 
sistent gains for using surrogate training tasks such as at¬ 
tributes (hpipaii, Upetas) ^ud, more prominently, external 
data for head features (hcasia, hcacd)- External data for 
head features particularly pays off for the most difficult day 
split. 

Conclusion The usage of significantly larger databases 
improves the robustness of our features, enabling recogni¬ 
tion in the most challenging scenarios. 

5.3. Importance of training data 

We also investigate how much collecting more data from 
each person identity can help to improve performance. In 
Figure 6 we compare the Original to the Day split and show 
performance for different sizes of training samples. While 
on the original split already after 10 training examples 80% 
performance is reached, the performance on the Day split 
sees a relatively slow improvement and stays below 70% 
with 25 samples (lagging 20% behind the Original split). 

Conclusion From Figure 6 we see that only increasing the 
training data will struggle to solve the harder Day split. Bet¬ 
ter features and better methods are required. 





















test 

naeil’s success 


train: confusing identity 



Figure 7: Examples of success cases on the Original split. First column shows the test instances that our systems correctly 
predict. Columns 5-7 correspond to train instances of the correct identity. Columns 2-4 are the training examples of the 
identity that PIPER [41] wrongly predicts. From top to bottom, the shown test instances are: (1) success case of f + h and 
failure case of PIPER; (2) success case ofp = f + h + u+b and failure case of PIPER and f + h; (3) success case of p + s 
and failure case of PIPER and p; and (4) success case of naeil,and failure case of PIPER and p + s. 


5.4. Analysis of remaining failure modes 

In Appendix §D we provide detailed statistics to study 
failure modes in the Original and Day splits. We discuss 
here the main findings. 

As expected, non-frontal faces are common failure cases 
for naeil’s in both Original and Day splits (^50%). For 
frontal faces, we observe in the Day split a larger propor¬ 
tion of failures than in the Original split. Even more, the 
majority of failures correspond to large heads (height > 
100 pixels), where good features can be extracted. To han¬ 
dle better more realistic scenarios it is thus important to im¬ 
prove the recognition of frontal faces across diverse settings 
and long time-spans. 

Another interesting aspect is that while naeil on the 
Original split has only one identity (out of 581) which is 
never correctly predicted, on the Day split the proportion of 
never correct identities jumps to 20%. This suggests that 
there are inherently difficult identities that our simplistic 
system currently cannot handle. 


6 . Conclusion 

We analysed the problem of person recognition in photo 
albums where people appear with various viewpoints, poses 
and occlusions. There are four major conclusions from our 
studies. First, head region, even when face is not visible, 
is a strong cue for person recognition, better than the face 
region itself (§3.5). Second, different cues, although from 
overlapping regions, are complementary (§4). Third, fea¬ 
ture learning with massive database of faces improves ro¬ 
bustness across time and appearance (§5.2). Fourth, simply 
increasing the number of training examples per person does 
not automatically solve the problem, and better recognition 
systems must be devised (§5.3). 

One possible research direction is collecting a large 
database of personal photo albums on which better features 
can be trained. One can also exploit album context, which is 
a rich source of identity information [16, 35, 36]; however, 
it was not used in this work for fair comparison. 

Our experimental data, including the new splits, trained 
models, naeil results, and attribute annotations are pub¬ 
lished at http://goo.gl/DKuhlY. 
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Appendix 

A. Content 

This appendix provides additional qualitative and quan¬ 
titative details of the experiments and results discussed in 
the main paper. It includes visualisations of newly pro¬ 
posed splits (§B), success and failure examples of our sys¬ 
tems (§C,D), detailed validation and test set tables (§F), and 
other technical details for the experiments (§E, G, H, I). 

B. More examples of splits 

We provide more examples of the proposed split on the 
PIPA dataset (Figure 10). The separation of appearances 
across splitg^^ becomes clearer as we shift from the Original 
to Day splits. 

C. More success and failure examples 

We provide additional qualitative examples of success 
and failure cases of our systems. Figures 11 to 13 show 
test instances (single images on the left) and training in¬ 
stances (triple images on the right) of the identity that the 
system predicted. The triple training instances are ordered 
in the nearest L 2 feature distance from the test image. The 
ticks and crosses denote whether the system’s prediction 
was correct or not. Note the symmetry of the left and right 
columns: left columns are cases that our systems (f+ h, 
p = f + h + u + b, naeil) correctly predicted while the 
PIPER did not, and the right columns present the reversed 
case. 

We also provide examples where neither naeil nor 
PIPER correctly predicted (Figure 14), which correspond 
to 9.35% of the whole test set. One can observe inter¬ 
personal confusion due to similar clothing and back¬ 
ground similarity (left top/middle, right top/middle), se¬ 
vere occlusions of body regions in the test image (left 
top/middle/bottom), and an annotation error (right bottom; 
note the marathoner’s front number). 

D. Failure modes 

We also provide an auxiliary visualisation to the obser¬ 
vations in §5.4. The top three plots in Figure 8 show the 
distribution of naeil’s failure cases with respect to three 
different factors: head orientation, resolution, and the body 
crop truncation. The bottom plot analyses per identity ac¬ 
curacy of naeil. 

Head orientation For head orientation, we see that in¬ 
deed failure cases have a greater proportion of non-frontal 
faces compared to the entire test set distribution. However, 
it can also be deduced that the failure cases are less corre¬ 
lated to the head orientations in the Day split setting, from 


the fact that the failure distribution in the Day split deviates 
less from the entire population’s distribution than does the 
Original split counterpart. 

Body crop truncation In the body crop truncation plot, 
we observe a homogeneity between Original and Day split 
distributions, and that having less image content in the body 
crop is indeed detrimental to the recognition in both Origi¬ 
nal and Day splits. 

Resolution The resolution plot shows how the resolution 
of a person instance is related to naeil’s ability to recog¬ 
nise the person. Note that head height was measured, since 
all different body crops (which excludes scene s) are pro¬ 
portional to the head size. Under the Day split, naeil has 
greater proportion of medium resolution heads ([100, 200) 
pixels) than lower resolution heads, while the entire pop¬ 
ulation has greater proportion of lower resolution heads. 
In other words, resolution is not positively correlated with 
naeil’s performance under the Day split. Hence, for exam¬ 
ple, picturing a person at a closer distance is not likely to 
greatly improve recognition across days. 

Per identity accuracy The final plot shows the nae i 1 ’s 
per identity performance. Note the increase in the propor¬ 
tion of never-identified individuals (leftmost bin) from the 
Original split to the Day split. This suggests that under 
the Day split there exist a meaningful number of identities 
which naeil cannot currently handle. 

E. Face detector details 

For face detection we use the state of the art DPM de¬ 
tector from [32]. This detector is trained on ^ 15k faces 
from the AFLW database, and is composed of 6 compo¬ 
nents which give a rough indication of face orientation: ±0° 
(frontal), ±45° (diagonal left and right), and ±90° (side 
views). Figure 15 shows example face detections on the 
PIPA dataset. It shows detections, the estimated orientation, 
the regressed head bounding box, the corresponding ground 
truth head box, and some failure modes. Faces correspond¬ 
ing to ±0° are considered frontal, and all others (±45°, 
±90°, and non-detected) are considered non-frontal. No 
ground truth is available to evaluate the face orientation esti¬ 
mation, except a few mistakes, the ±0° components seems a 
rather reliable estimators (while more confusion is observed 
between ±45°/±90°). 

F. Detailed results 

F.l. Detailed validation set results 

See Table 5 for detailed results on the validation set. It 
also shows the increase in performance as we zoom out/in 
from the face(f )/scene(s). When we zoom out, we already 
gain most of the identity information from the face to upper 
body regions, and the rest contributes only marginally. As 
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Figure 8: Top three: distribution of instances with respect to 
failure factors. Bottom: distribution of identities according 
to naeil’s performance for each identity. 


Effect of fine-tunning on recognition accuracy 



Figure 9: Validation set performance of different cues, as a 
function of the fine-tuning duration. 


we zoom in, a rather gradual improvement is observed. It 
is notable, however, that s+b (82.16%) is already almost 
as effective as the two part counterpart in the zoom out sce¬ 
nario, f+h (84.80%). 

F. 2. Detailed test set results 

See Table 6 for the test set results for different experi¬ 
mental setups. Note that the addition of external data, in¬ 
cluding the attribute cues (PIPA attributes and PETA at¬ 
tributes) and large face databases (CACD and CASIA), is 
especially effective in the Day split setting: from 42.31% 
by Ps = P + s to 46.54% by naeil = Ps+E . 

G. How much fine-tuning ? 

Task Unless otherwise stated, we fine-tune the ImageNet 
pre-trained AlexNet [24] on the PIPA person recognition 
train set. The initial weights of the AlexNet are obtained 
by training on ImageNet for objects classification, and are 
further optimised by training with different region crops of 
PIPA train set images for the identity classification task. 

Number of iterations Figure 9 verifies that 300k itera¬ 
tions with mini batch size 50 gives maximal, or close to 
maximal, performance for most cues. In fact the plateau 
is reached already at 100k, but we use 300k as precaution. 
Note that we do not observe any over-fitting behaviour. 

In the main paper we report the results of fine-tuning for the 
scene region s, this is the only region that does not show a 
large gain due to fine-tuning. 

The results of Figure 9 are obtained by training and test¬ 
ing SVMs on the PIPA validation set original splits, using 
AlexNet features obtained via fine-tuning on the PIPA train¬ 
ing set. 

Implementation details Our implementation uses the 
Caffe library [23]^, and the provided AlexNet model 

bvlc_alexnet.caffemodel. 

For fine-tuning, we use the following training configura¬ 
tions parameters: 

(prototxt for solver configuration) 
base_lr: 0.0001 
lr_policy: "step" 
gamma: 0.1 
stepsize: 50000 
momentum: 0.9 
weight_decay: 0.0005 

(prototxt for net specification) 
batch_size: 50 

Regarding the per-identity SVMs, we fix the SVM para¬ 
meter (7 at 1 throughout the paper. Preliminary experiments 
indicated that this was not a sensitive parameter. 

^https://github.com/BVLC/caffe 








































H. SVM versus NN 


Table 7 compares the validation set accuracy of different 
cue combinations, when using (per-identity) SVM classifi¬ 
ers or a nearest neighbour (NN) classifier. 

The results show that using an SVM per identity is con¬ 
sistently better than a naive nearest neighbour classifier. 

I. Attributes 

Table 8 shows the definitions of attribute classes that we 
annotated on PIPA head crops. We did not annotate attrib¬ 
utes for identities (1) whose appearances are indecisive for 
attribute classification (e.g. gender), and (2) whose attrib¬ 
utes change in PIPA (e.g. sunglasses). We will release the 
annotations. 

For upper body crops, we use the PETA dataset [11] 
and five selected binary attribute annotations (out of 105), 
namely the age (from 15 to 30), age (from 30 to 45), gender, 
black hair, and short hair. The selection is based on (1) 
enough training samples (> 25% of the PETA for both pos¬ 
itive/negative classes), (2) upper body related attributes (3) 
attributes that conventionally persist across a day. 

For detailed results on the attributes, see Table 9. We 
note that the gender cues (both PIPA and PETA) give the 
greatest performance gain. 


Cue 


Accuracy 


Chance level 


0.27 


Scene 

s 

27.06 


Body 

b 

80.81 


Upper body 

u 

84.76 


Head 

h 

83.88 


Face 

f 

74.45 


Zoom out 

f 

74.45 



f + h 

84.80 



f + h + u 

90.65 



f+h+u+b 

91.14 



f + h+u+b + s 

91.16 


Zoom in 

s 

27.06 



s+b 

82.16 



s+b+u 

86.39 



s+b+u+h 

90.40 



s+b+u+h+f 

91.16 


Head+body 

h+b 

89.42 


Face+head 

f+h 

84.80 



f+h+u 

90.65 



f + h+b 

90.19 


Full person 

P = f+ h + u+b 

91.14 


Full image 

Ps = P + S 

91.16 


Table 5: Validation set accuracy of different cues. 

Setup 

Method^^^^ 

Original Album 

Time 

Day 

Chance level 

0.17 0.17 

0.17 

0.50 

h-rgb 

33.77 27.19 

16.91 

6.78 

S 

24.71 19.89 

12.83 

8.67 

b 

69.63 59.29 

44.92 

20.41 

h 

76.42 67.48 

57.05 

36.37 

h + b 

83.36 73.97 

63.03 

38.12 

P = f+ h + u+b 

85.33 76.49 

66.55 

42.14 

P, =P + s 

85.71 76.68 

66.55 

42.24 

naeil = P^+E 

86.78 78.72 

69.29 

46.61 

PIPER [41] 

83.05 

- 

- 

Table 6: Recognition accuracy across 
mental setups on the test data. 

different 

experi- 

Extended data E = f 

^casia "^OpetaS • 


Method 

Accuracy 

Head 

h 

83.88 



h-nn 

74.92 


Head+Body 

h+b 

89.42 




79.63 


Full Person 

P = f+ h+u+b 

91.14 



Pnn 

77.31 



Attribute 

Classes 

Criteria 

Age 

Infant 

Not walking (due to young age) 


Child 

Not fully grown body size 


Young Adult 

Fully grown & Age < 45 


Middle Age 

45 < Age < 60 


Senior 

Age> 60 

Gender 

Female 

Female looking 


Male 

Male looking 

Glasses 

None 

No eyewear 


Glasses 

Glasses without eye occlusion 


Sunglasses 

Glasses with eye occlusion 

Haircolour 

Black 

Black 


White 

Any hint of whiteness 


Others 

Neither of the above 

Hairlength 

No hair 

No hair on the scalp 


Less hair 

Hairless for > I upper scalp 


Short hair 

When straightened, < 10 cm 


Med hair 

When straightened, <chin level 


Long hair 

When straightened, >chin level 


Table 8: PIPA attributes details. 



Method 

Accuracy 

Head 

h 

83.88 


hpipallm 

74.63 


hpipall 

81.74 


h + hpipaii 

85.00 


h h^ge 

84.40 


h H“ hggnder 

84.69 


h hglggges 

84.30 


h j_2:colour 

84.25 


h h^giriength 

84.39 

Upper Body 

U 

84.76 


UpetaSm 

75.71 


^petaS 

77.50 


U -|- UpetaS 

85.18 


U + Uagel 

84.75 


U + Uage2 

84.81 


U 3“ Hgender 

84.90 


U -p Uhairshort 

84.87 


U + Uhairblack 

84.80 

Head+Upper Body 

hpipgii -p UpetaS 86.17 


h + u 

85.77 


h + u + A 

90.12 


Table 9: Validation set accuracy of different attribute cues. 


Table 7: Validation set accuracy using SVM versus NN. 
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Figure 10: Example of different split types over three identities. 







































Figure 11: Success and failure cases of f + h under the Original split. Left column, PIPER fails, f + h recognizes correctly. 
Right column, shows the inverse case. 
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Figure 12: Success and failure cases ofp = f + h + u+b under the Original split. 
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Figure 13: Success and failure cases of naeil under the Original split. 


























































Figure 14: Failure examples of both naeil and PIPER under the Original split. 





























































(a) -90° (b) +90° 



(c) -45° (d) +45° 




(f) Missing detections 


matched ground truth head 
detected face 
head estimated from face 
missed ground truth head 

(g) Legend 



Figure 15: Examples results from the face detector (PIPA validation set), and estimated head boxes. 



























